Tradeoffs in Main-Memory Statistical Analytics from Impala to DimmWitted
نویسندگان
چکیده
Recent years have seen a surge in main-memory SQL-style analytic solutions to quickly deliver business critical information over massive data sets [1, 7, 14]. At the same time, there is an arms race to offer increasingly sophisticated statistical analytics inspired by the success of web search, voice recognition, and image analysis, e.g., Google Brain [8], Facebook [6], and Microsoft's Adam [2]. This talk describes the first author's experience porting statistical analytics to Impala via MADlib and observations about research for high-performance main-memory analytics that may be relevant for systems like Impala. A major motivation for Impala was to enable interactive SQL-analytics queries over data stored in Hadoop. Impala achieves high performance through many techniques including as co-location of computation with data in HDFS, LLVM code generation [13], and aggressive use of SIMD instructions. These optimizations allow Impala to achieve 8x query throughput compared to Shark and Hive for queries in the TPC-DS benchmark [3], and a recent independent benchmark has shown that Impala is about 5 times faster than Hive on MapReduce for TPC-H queries on uncompressed data [10]. We also want high performance statistical analytics in Impala without major changes to its infrastructure. We started with an approach popularized in MADlib, an existing package for in-RDBMS analytics [4]. We ported a subset of MADlib's statistical models to Impala [5], many of which use the Bismarck architecture [9] that allows statistical analytics via user-defined functions. In particular, the main algorithm is Stochastic Gradient Descent (SGD) a method that has a low memory footprint, rapid convergence, and is a near de facto standard for web-scale learning. SGD captures a wide variety of statistical models including Support Vector Machines (SVMs), Logistic Regression, and Matrix Factorization. Moreover, SGD's row-wise data access pattern matches the access pattern of User Defined Aggregates [9]. The port has received positive feedback from customers for its scalability, speed, and breadth of machine learning tasks. While the MADlib port enables some statistical analytics in Impala, it is only a first step: its data layout may be suboptimal, and it may not fully utilize commodity hardware. For example, as we describe SGD, it can be viewed as a row-store access method, and it is natural to wonder if there is a column-store equivalent. Indeed, there is a closely related algorithm call Stochastic Coordinate Descent (SCD). In our recent work, we have described asynchronous versions of both SGD [12] and SCD …
منابع مشابه
DimmWitted: A Study of Main-Memory Statistical Analytics
We perform the first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems differ from conventional SQL-analytics in the amount and types of memory incoherence that they can tolerate. Our goal is to understand tradeoffs in ac...
متن کاملP-V-L Deep: A Big Data Analytics Solution for Now-casting in Monetary Policy
The development of new technologies has confronted the entire domain of science and industry with issues of big data's scalability as well as its integration with the purpose of forecasting analytics in its life cycle. In predictive analytics, the forecast of near-future and recent past - or in other words, the now-casting - is the continuous study of real-time events and constantly updated whe...
متن کاملParasites of domestic and wild animals in South Africa. XV. The seasonal prevalence of ectoparasites on impala and cattle in the Northern Transvaal.
The prevalence of ectoparasites on a total of 36 impala (Aepyceros melampus) slaughtered monthly from February 1975 to February 1976 and a total of 24 cattle slaughtered monthly from March 1976 to March 1977 in the Nylsvley Provincial Nature Reserve was determined. Six species of ixodid ticks were collected from the impala and these, in order of abundance, were: Rhipicephalus evertsi evertsi, R...
متن کاملBig Data Analytics and Now-casting: A Comprehensive Model for Eventuality of Forecasting and Predictive Policies of Policy-making Institutions
The ability of now-casting and eventuality is the most crucial and vital achievement of big data analytics in the area of policy-making. To recognize the trends and to render a real image of the current condition and alarming immediate indicators, the significance and the specific positions of big data in policy-making are undeniable. Moreover, the requirement for policy-making institutions to ...
متن کاملImpala: A Modern, Open-Source SQL Engine for Hadoop
Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014